Mining Parenthetical Translations from the Web by Word Alignment

نویسندگان

  • Dekang Lin
  • Shaojun Zhao
  • Benjamin Van Durme
  • Marius Pasca
چکیده

Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. Our experiment on Chinese web pages produced more than 26 million pairs of translations, which is over two orders of magnitude more than previous results. We show that the addition of the extracted translation pairs as training data provides significant increase in the BLEU score for a statistical machine translation system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Parenthetical Translations for Polish-English Lexica

Documents written in languages other than English sometimes include parenthetical English translations, usually for technical and scienti c terminology. Techniques had been developed for extracting such translations (as well as transliterations) from large Chinese text corpora. This paper presents methods for mining parenthetical translation in Polish texts. The main di erence between translati...

متن کامل

Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages

This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...

متن کامل

Improved Word Alignments Using the Web as a Corpus

We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, li...

متن کامل

The Bilingual Concordancer TransSearch

TRANSSEARCH is a web-based translation search engine. When a user submits a translation query, the system replies with a set of sentence pairs whose source sentence contains the query. The source expression is highlighted and, with the help of statistical word alignment techniques, the corresponding target expression is also identified. When many sentences share the same translations, the trans...

متن کامل

Word-aligned Parallel Text – A New Resource for Contrastive Language Studies

This paper describes the opportunities that arise from automatic word alignment for bilingual concordances and contrastive language studies. We introduce our parallel corpus of Alpine texts in French and German and our web-based alignment search system. We explain how we have reduced the number of erroneous alignments in the output by distinguishing between dominant and miscellaneous translatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008